-
Notifications
You must be signed in to change notification settings - Fork 0
fix: optimize performance for platform-wide querying #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As we began running performance tests against the activity apiserver, we noticed that platform-wide queries were performing drastically worse than tenant-level queries. This was a result of our initial schema being designed to order data by tenant resulting in platform-wide queries scanning the entire data set instead of being able to skip over irrelevant rows. This change makes several adjustments to the schema to improve querying performance of the clickhouse database. - Moved to daily partitions so that partitions can be TTL'd each day instead of only when the month is over. This should also ensure that queries only need to scan fewer partitions because all queries will be time-bound. - Removes unnecessary skip indexes on fields already present in the ordering of the data. Skip indexes won't provide much performance benefit if the ordering is used. - Moves to using a ReplacingMergeTree database engine to ensure that all audit logs are unique. Removing duplicates is a background operation so users _may_ see duplicates if a merge operation hasn't been performed. We'll mitigate this in the collection pipeline by putting guardrails in place to prevent duplicates from being sent to Clickhouse. - Adds indexes for fields used for common querying patterns to help skip over irrelevant rows. - Creates new projections that are designed to efficiently query audit logs across all tenants. The platform-wide query projection is designed to support platform administrators querying across all tenants. Queries will be most performant when they query by a specific api group and resource which will be the most common querying pattern for cross-tenant queries. Also introduces a query projection for user-specific queries to help platform administrators query for all audit logs related to a specific user. I've modified the 001_initial_schema.sql migration instead of adding a new migration because this service has not been released yet. I've also removed `stage` from the schema and the querying interface since we're only collecting the `ResponseComplete` stage from the system. The apiserver has also been adjusted to intelligently change the order by used when querying clickhouse to ensure projections are used based on the query being performed by the end-user. Lastly, the performance tests have been updated to better reflect real-world querying behavior where the api group / resource are present in the queries. See: https://clickhouse.com/docs/data-modeling/projections
We need to configure the merge behavior of projections since we swapped over to the replacing merge tree engine.
This change adjusts the NATS stream configuration to support a 10 minute de-duplication window. The NATS message ID has been set to the audit log ID since the ID will be unique across all audit logs.
e63a8e6 to
21ce2cf
Compare
Have to enable JetStream to take advantage of the message_id option.
21ce2cf to
a5b914a
Compare
migrations/001_initial_schema.sql
Outdated
| -- Primary key optimized for tenant-scoped queries | ||
| -- Deduplication occurs on the full ORDER BY key during merges | ||
| ORDER BY (timestamp, scope_type, scope_name, user, audit_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with @drrev about this Clickhouse schema and he mentioned we should try adjusting this to use the following schema change and see if there's any additional performance gains.
ORDER BY (toStartOfHour(timestamp), scope_type, scope_name, user, audit_id, timestamp)
PRIMARY KEY (toStartOfHour(timestamp), scope_type, scope_name, user, audit_id)Maybe even use toStartOfMinute depending on volume of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a performance test against an updated schema that uses the toStartOfHour() for the order by / primary key to group events into 1 hour buckets.
The performance test completed successfully with no errors.
We were able to top out at 96.5 RPS with a P99 response time of 2.5 seconds.
Pushed a change in a6ce224 that adopts the recommended schema changes and apiserver pagination implementation.
Thanks @drrev for the recommendation!
|
How big was the dataset in the tests? For Clickhouse it seems like 60 RPS, especially for reads, is several orders of magnitude below what it should be, even with billions of documents (which I'm assuming we didn't have...). |
|
@drewr the system had around 77 million rows in it, spread across ~3 days. Most of the queries in the performance test would have been scanning the last 24 hours, so should be looking at ~25 million rows per query. |
The current schema uses millisecond-precision timestamps as the primary key's leading column. For a multi-tenant audit logging system ingesting events from many Kubernetes control planes, this causes: - Poor data locality: Events from the same tenant arriving at slightly different times are scattered across granules - Suboptimal compression: Similar events aren't co-located, reducing compression effectiveness - Inefficient deduplication: Duplicate events within the same hour aren't consistently grouped - Higher storage costs: More granules and lower compression ratios increase disk usage Hour bucketing addresses these issues by clustering events from the same tenant/scope/user within hour boundaries, aligning with how audit logs are queried (typically 1h, 24h, or 7d time ranges).
ecv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR v.large, approving on trust
## Summary This PR makes additional improvements to platform-wide querying performance, adjusts the clickhouse audit log schema to use the correct timestamp for the request, adds support for querying by the user's UID, and adjusts the user-scoped projection to use the user's UID value instead of the username. ## Details - **Filter by User's UID** - Filtering by UID can be valuable to filter down to a specific user using a stable identifier instead of an email which can be changed by the user. UIDs are also only in place for users of the platform. Internal components that authenticate with certificates do not have UIDs. This gives us a clean way of filtering out internal components from audit logs. - **Request Received Timestamp** - I swapped to using the `.requestReceivedTimestamp` field of the audit log to represent the audit log's timestamp since it's the timestamp when the request was received by the apiserver. The `.stageTimestamp` is used by the collection pipeline to calculate delays in the pipeline because the timestamp indicates when the audit log was generated by the apiserver. - **User UID for user scope** - I swapped to using the user's UID as the filtering / sorting column when querying the audit log system through the user scope since the UID is the stable identifier for the user and is the value that's provided in the user's extra information. - **Hourly timestamp buckets** - Updated all projections to use the same hourly time bucketing introduced in #23. --- Relates to datum-cloud/enhancements#536
Summary
Optimize Clickhouse database schema for platform-wide and user-specific querying of multi-tenant audit log data.
Details
As we began running performance tests against the activity apiserver, we noticed that platform-wide queries were performing drastically worse than tenant-level queries. See datum-cloud/enhancements#536 (comment) for a comparison.
This was a result of our initial schema being designed to order data by tenant resulting in platform-wide queries scanning the entire data set instead of being able to skip over irrelevant rows.
Performance improving changes
I've modified the
001_initial_schema.sqlmigration instead of adding a new migration because this service has not been released yet.Related changes
stagefrom the schema and the querying interface since we're only collecting theResponseCompletestage from the system.Unrelated changes
Performance test results
These performance tests were executed against a 3-node Clickhouse cluster with each node being allocated 8 CPU and 30Gi of RAM. The performance test ramps up RPS against the system querying 24 hours of audit logs. We're collecting around ~24M audit logs per day.
Previous Clickhouse schema
This shows a performance test that was run against the activity system that was focused on tenant-level querying. The graphs show that the activity api would struggle with a small number of platform-level queries (~4 RPS) and queries would immediately begin timing out.
New optimized Clickhouse schema
This performance test demonstrates the improvements that were made after the new schema was applied. We were able to top out at 96.5 RPS with a P99 response time of 2.5 seconds. No timeouts or errors occurred with the performance test.
Resources
Relates to datum-cloud/enhancements#536